Skip to content

Conversation

@grobinson-grafana
Copy link
Contributor

@grobinson-grafana grobinson-grafana commented Oct 15, 2025

What this PR does / why we need it:

This commit fixes a race condition in the push tracker where both the p.done and p.err channels could have a value at the same time. This could cause the distributor to return success when it should instead have returned an error.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • Title matches the required conventional commits format, see here
    • Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

@grobinson-grafana grobinson-grafana force-pushed the grobinson/fix-race-condition-in-tracker branch 8 times, most recently from 5b34fb2 to b3d4ad5 Compare October 16, 2025 12:00
@grobinson-grafana grobinson-grafana marked this pull request as ready for review October 16, 2025 12:57
@grobinson-grafana grobinson-grafana requested a review from a team as a code owner October 16, 2025 12:57
@grobinson-grafana grobinson-grafana force-pushed the grobinson/fix-race-condition-in-tracker branch from b3d4ad5 to 2301021 Compare October 16, 2025 13:48
This commit fixes a race condition in the push tracker where
both the p.done and p.err channels could have a value at the
same time. This could cause the distributor to return success
when it should instead have returned an error.
@grobinson-grafana grobinson-grafana force-pushed the grobinson/fix-race-condition-in-tracker branch from 2301021 to 5dc2cd5 Compare October 16, 2025 13:50
@grobinson-grafana grobinson-grafana added the claude-review Requests a review from Claude Code when added label Nov 13, 2025
@claude
Copy link

claude bot commented Nov 13, 2025

Claude finished @grobinson-grafana's task —— View job


Code Review Complete ✅

I've thoroughly reviewed this PR that fixes a race condition in the push tracker. Here's my detailed analysis:

Review Checklist:

  • Read and understand the changed files
  • Analyze the race condition fix implementation
  • Review test coverage and quality
  • Check for potential bugs or security issues
  • Assess performance implications
  • Provide constructive feedback

Summary

This is an excellent fix that addresses a critical race condition in the distributor's push tracking mechanism. The change replaces an unsafe channel-based approach with a proper mutex-protected state machine.

Race Condition Analysis ⭐

The Problem: The original pushTracker had a fundamental race condition where both p.done and p.err channels could receive values simultaneously:

// OLD CODE - UNSAFE
func (p *pushTracker) doneWithResult(err error) {
    if err == nil {
        if p.streamsPending.Dec() == 0 {
            p.done <- struct{}{}  // ← Could send here
        }
    } else {
        if p.streamsFailed.Inc() == 1 {
            p.err <- err          // ← AND here simultaneously
        }
    }
}

The Wait() method used a select that could receive from either channel, potentially returning success when an error occurred.

The Fix: The new basicPushTracker uses proper synchronization:

  • Mutex protects all state changes
  • Atomic operations replaced by mutex-protected counters
  • Channels used only for signaling, not data transfer
  • Error stored in protected state rather than sent through channel

Code Quality Assessment

✅ Strengths

  1. Thread Safety: Proper mutex usage eliminates the race condition
  2. Clear API: Well-defined PushTracker interface with good documentation
  3. Error Handling: First-error semantics are correctly implemented
  4. Panic Behavior: Consistent with sync.WaitGroup patterns
  5. Performance: Fast-path optimization with done bool field
  6. Resource Management: Proper channel creation/closing

✅ Test Coverage

The test suite is comprehensive and well-designed:

  • Edge Cases: Empty tracker, context cancellation, error propagation
  • Race Conditions: Fuzz testing with concurrent goroutines
  • API Misuse: Proper panic testing for invalid usage patterns
  • Error Semantics: First-error behavior verification
  • Concurrency: Multi-waiter scenarios with randomized timing

The fuzz test is particularly valuable for catching race conditions.

Technical Implementation

Correct Synchronization Pattern

func (t *basicPushTracker) Wait(ctx context.Context) error {
    t.mtx.Lock()
    if t.firstErr != nil || t.n == 0 {
        res := t.firstErr
        t.mtx.Unlock()
        return res  // Fast path - no channel operations needed
    }
    t.mtx.Unlock()
    
    select {
    case <-ctx.Done():
        return ctx.Err()
    case <-t.doneCh:
        return t.firstErr  // No mutex needed - immutable after close
    case <-t.errCh:
        return t.firstErr  // No mutex needed - immutable after close
    }
}

This pattern is correct because:

  • State is checked under mutex first (fast path)
  • Channels are used only for blocking/signaling
  • Once channels are closed, firstErr becomes immutable
  • Context cancellation is properly handled

Memory Safety

The implementation correctly handles the fact that once either channel is closed, the firstErr field becomes immutable, eliminating the need for mutex protection in the select branches.

Performance Considerations

Positive impacts:

  • ✅ Fast path avoids channel operations when result is immediately available
  • ✅ Single mutex is more efficient than atomic operations on multiple fields
  • ✅ No unnecessary channel buffer allocations

Neutral impact:

  • The mutex overhead is negligible compared to distributed push operations
  • Channel-based signaling is still used but safely

Security Assessment

  • ✅ No security concerns identified
  • ✅ Proper error propagation prevents information leakage
  • ✅ No resource leaks in error paths
  • ✅ Panic behavior is consistent and documented

Integration Analysis

The changes integrate well with the existing codebase:

  • Interface abstraction allows future tracker implementations
  • Existing call sites require minimal changes
  • Maintains backward compatibility in behavior
  • Proper error propagation to callers

Minor Suggestions

  1. Documentation: Consider adding a comment explaining why both doneCh and errCh are needed (early error termination vs. completion)

  2. Naming: The function comment references [newSimplePushTracker] but should reference [basicPushTracker]

Verdict: APPROVED

This is a high-quality fix that:

  • ✅ Correctly solves the race condition
  • ✅ Maintains existing API semantics
  • ✅ Includes comprehensive tests
  • ✅ Follows Go best practices
  • ✅ Has no performance regressions
  • ✅ Improves system reliability

The implementation demonstrates solid understanding of Go concurrency patterns and the fix directly addresses the core issue described in the PR.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-review Requests a review from Claude Code when added size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants